smFISH Heteroplasmy Analysis

Author

Abhilesh Dhawanjewar

Published

November 12, 2025

Each NSC gives rise to 6 progeny cells and the progeny cells have less mutant mtDNA than the NSCs they derive from. There is extensive progeny cell-to-progeny cell variability in heteroplasmy levels, even for those progeny cells derived from the same NSC.

Load and preprocess data
data_dir <- here("data")
plot_dir <- here("figures")

if (!dir.exists(plot_dir)) {
    dir.create(plot_dir, recursive = TRUE)
}

# Load excel data
cell_volume_data <- read_excel(
    here(data_dir, "NSC-progeny cell volume.xlsx")
)
mtDNA_number_data <- read_excel(
    here(data_dir, "NSC-progeny FISH mtDNA number.xlsx")
)

# Clean column names
cell_volume_data <- cell_volume_data %>% clean_names()
mtDNA_number_data <- mtDNA_number_data %>% clean_names()

cell_volume_data <- cell_volume_data %>% rename("cell_type" = "x1")
# Rename columns using janitor
mtDNA_number_data <- mtDNA_number_data %>%
    rename_with(~ gsub("mt_dna_", "mtDNA_", .), starts_with("mt_dna"))

# Colours for plots
NSC_color <- "#7384E9"
progeny_color <- "#97DDEC"
sim_color <- "#E64B35B2"

# Labels for lineages in plots
lineage_labels <- setNames(
    paste("Lineage", sort(unique(mtDNA_number_data$lineage))),
    sort(unique(mtDNA_number_data$lineage))
)

# Calculate heteroplasmy for each observation
mtDNA_number_data <- mtDNA_number_data %>%
    mutate(heteroplasmy = (mtDNA_mut / mtDNA_total) * 100)

Observed volumes and mtDNA copy number

There is an approximate reduction of 12-14x in cellular volume from NSCs to progeny cells.

Figure 1: Cellular, cytoplasmic, and nuclear volumes in NSCs and progeny cells. Boxplots show median, interquartile range, and individual data points. NSCs exhibit approximately 12-14× larger volumes than progeny cells across all compartments (Wilcoxon rank-sum test, p < 0.001 for all comparisons).
Figure 2: mtDNA copy number in NSCs and progeny cells. Each dot represents a single cell, colored by lineage. NSCs contain significantly more mtDNA molecules than progeny cells (Wilcoxon rank-sum test, p < 0.001), with approximately 12× reduction consistent with cellular volume changes.

The mtDNA copy number was significantly lower in progeny cells compared to NSCs (Wilcoxon rank-sum test, p = 2.5e-09).

Comparison of mtDNA copy number and cellular volume between NSCs and progeny cells
NSC (Mean ± SD) Progeny (Mean ± SD) Fold Reduction
mtDNA Copy Number 208.7 ± 68.4 16.4 ± 7.1 12.7×
Cellular Volume (μm³) 115.9 ± 38.8 9.5 ± 4.2 12.2×

There is an approximate 12-14x reduction in mtDNA copy number from NSCs to progeny cells, similar to the reduction in cellular volume.

Observed heteroplasmy levels

The variance in the heteroplasmy levels for progeny cells is higher compared to NSCs. The average heteroplasmy levels are lower in progeny cells compared to NSCs.

Figure 3: Heteroplasmy levels in NSCs and individual progeny cells for each of 14 lineages. Each NSC (left) gives rise to ~6 progeny cells (right). Progeny cells show increased variance and generally lower heteroplasmy compared to their parent NSC, with substantial cell-to-cell variability even among progeny from the same NSC.
Comparison of heteroplasmy levels between NSCs and progeny cells
NSC (Mean ± SD) NSC (Median, IQR) Progeny (Mean ± SD) Progeny (Median, IQR) Fold Change P-value
Heteroplasmy (%) 66.4 ± 5.7 64.5 (IQR: 6.8) 48.6 ± 11.8 50.0 (IQR: 16.8) 1.37× 7.6e-07

The mean heteroplasmy level was significantly lower in progeny cells compared to NSCs (Wilcoxon rank-sum test, p = 7.6e-07).

Plotting the heteroplasmy levels for all NSCs and progeny cells across all lineages.

Figure 4: Heteroplasmy levels in NSCs and progeny cells across all lineages

Plotting the observed heteroplasmy in NSCs and progeny cells by Lineage

Figure 5: Heteroplasmy distributions in NSCs and progeny cells faceted by lineage. Each panel shows one NSC (left) and its ~6 progeny cells (right). Points are colored by lineage. The shift toward lower heteroplasmy and increased variance in progeny is evident across all lineages, though the magnitude varies

Mixed-effects models for heteroplasmy variances

To test whether the heteroplasmy variances differ between NSCs and progeny cells, we used linear mixed-effects models (Pinheiro and Bates, 2000) with cell type as a fixed effect and lineage as a random effect. We compared a null model assuming equal variances (Model 0) to an alternative model allowing for different variances in the two cell types (Model 1) using a likelihood ratio test. For indices \(i\) representing cell lineage, \(j\) representing individual observations, \(k\) representing cell type (NSC or Progeny) of observation \(j\) and heteroplasmy \(h_{ij}\), the models are defined as follows:

  • Model 0 (null model): Equal variances \[logit(h_{ij}) = \beta_{0} + \beta_{1} \cdot \text{(cell type)}_{k} + u_{i} + \epsilon_{ij}\]

    where:

    • \(u_i \sim N(0, \sigma_{u}^{2})\) \(\to\) random intercept for lineage \(i\)
    • \(\epsilon_{ij} \sim N(0, \sigma^{2})\) \(\to\) residual error with equal variance across cell types
  • Model 1 (alternative model): Unequal variances \[logit(h_{ij}) = \beta_{0} + \beta_{1} \cdot \text{(cell type)}_{k} + u_{i} + \epsilon_{ij}\]

    where:

    • \(u_i \sim N(0, \sigma_{u}^{2})\) \(\to\) random intercept for lineage \(i\)
    • \(\epsilon_{ij} \sim N(0, \sigma_{k}^{2})\) \(\to\) residual error with variance \(\sigma_{k}^{2}\) specific to cell type \(k\)

Raw heteroplasmy data was logit transformed and squeezed to avoid issues with 0 and 100% values before fitting the models. The \(\delta\) value was set to \(\frac{1}{2N_{max}}\), where \(N_{max}\) is the largest observed mtDNA copy number, to avoid issues with zero or one heteroplasmy values while minimizing bias. \[logit(h) = ln (\frac{h}{1-h})\]

Models were fit with Maximum Likelihood (ML) for the likelihood ratio test and with Restricted Maximum Likelihood (REML) for parameter estimation.

Linear mixed-effects modeling of heteroplasmy variances
# Logit transform heteroplasmy to fit normality assumptions
mtDNA_logit_data <- mtDNA_number_data %>%
    dplyr::mutate(
        heteroplasmy_prop = heteroplasmy / 100,
        # Calculate epsilon based on the largest observed mtDNA pool in the data
        epsilon = 1 / (2 * max(mtDNA_total, na.rm = TRUE)),
        heteroplasmy_prop_squeezed = pmax(epsilon, pmin(1 - epsilon, heteroplasmy_prop)),
        logit_heteroplasmy = car::logit(heteroplasmy_prop_squeezed)
    )

# Model 0 (null model): Equal variances
lme_null_model <- lme(
    logit_heteroplasmy ~ cell_type,
    random = ~ 1 | lineage,
    data = mtDNA_logit_data,
    method = "ML"
)

# Model 1 (alternative model): Unequal variances
lme_alt_model <- lme(
    logit_heteroplasmy ~ cell_type,
    random = ~ 1 | lineage,
    weights = varIdent(form = ~ 1 | cell_type),
    data = mtDNA_logit_data,
    method = "ML"
)

# Likelihood ratio test
lrt_result <- anova(
    lme_alt_model,
    lme_null_model
)

lrt_l_ratio <- round(lrt_result$L.Ratio[2], 2)
lrt_p_value <- format(
    lrt_result$`p-value`[2], scientific = FALSE, digits = 4
)

# Summary of the best-fit model
lme_alt_summary <- summary(
    update(lme_alt_model, method = "REML")
)

# Extract coefficients and p-values
tTable <- lme_alt_summary$tTable

beta_progeny <- tTable[
    "cell_typeProgeny", "Value"
    ]
se_progeny <- tTable[
    "cell_typeProgeny", "Std.Error"
    ]
p_progeny <- tTable[
    "cell_typeProgeny", "p-value"
    ]
Figure 6: Diagnostic plots for the linear mixed-effects model on logit-transformed heteroplasmy
Figure 7: Q-Q plot of residuals for the linear mixed-effects model on logit-transformed heteroplasmy

The residuals vs. fitted values plot shows no systematic pattern, indicating that the model assumptions of constant variance and linearity are met. The Q-Q plot demonstrates that the residuals are approximately normally distributed, supporting the validity of the linear mixed-effects model.

Including mtDNA copy number as a covariate did not significantly improve the model fit (LRT p = 0.1479), suggesting that the observed heteroplasmy differences are not driven by differences in mtDNA copy number.

Figure 8: Heteroplasmy variance in NSCs and progeny cells. Violin plots show kernel density estimates, with boxplots (median and IQR) and individual data points overlaid. Progeny cells exhibit significantly greater heteroplasmy variance than NSCs (LRT p < 0.001 for unequal variance model), despite having lower mean heteroplasmy. Each point represents a single cell

The linear mixed-effects model on the logit-transformed heteroplasmy data revealed that the odds ratio for progeny cells vs NSC was 0.47 (\(\beta\) = -0.751, SE = 0.069, p < 0.001), indicating that progeny cells had 53% lower odds of carrying mutant mtDNA compared to NSCs. Progeny cells also showed significantly greater heteroplasmy variability than NSCs, as determined by a likelihood ratio test (LRT) comparing the model allowing for different variances to a null model assuming equal variances (LRT = 11.39, p = 0.000738).

Simulations of mtDNA segregation from NSCs to progeny cells

To test whether the observed heteroplasmy variance and shifts in progeny cells could arise purely from stochastic mtDNA partitioning, we simulated heteroplasmy under a model of random drift.

The simulation strategy was as follows:

  1. Simulations were performed independently for each observed cell lineage.
  2. For each lineage, a virtual parent NSC was created containing the same number of wild-type and mutant mtDNA molecules as observed experimentally.
  3. For each progeny cell, a corresponding total mtDNA copy number was assigned based on the experimentally observed values.
  4. A simulated sample of mtDNA molecules was drawn from the parent pool using hypergeometric sampling, modeling partitioning without replacement so that molecules assigned to one progeny cell were removed from the pool and unavailable for subsequent draws.
  5. The heteroplasmy level for each simulated progeny cell was then calculated as the ratio of mutant molecules to the sampled mtDNA total.
  6. The simulation of the entire lineage was repeated 10,000 times to generate a null distribution of heteroplasmy levels for each progeny cell.
Simulation functions for modeling random drift
simulate_lineage_hypergeometric <- function(
    nsc_wt, nsc_mut, observed_totals, n_draws = 999) {
    n_progeny <- length(observed_totals)

    simulated_draws <- replicate(n_draws,
        {
            # Initialize the pool for this single simulation run
            remaining_wt <- nsc_wt
            remaining_mut <- nsc_mut

            # Create empty vectors to store the results for each progeny cell
            wt_samples <- numeric(n_progeny)
            mut_samples <- numeric(n_progeny)

            # Loop through each progeny cell, drawing from and then shrinking the pool
            for (i in seq_len(n_progeny)) {
                # The total mtDNA available for this draw
                current_total_pool <- remaining_wt + remaining_mut

                # If the current pool is empty, we cannot draw any mtDNA
                if (current_total_pool == 0) {
                    wt_samples[i] <- 0
                    mut_samples[i] <- 0
                    next
                }

                # The number of molecules to sample for this progeny cell
                k_draw <- min(observed_totals[i], current_total_pool)

                # Hypergeometric draw without replacement from the current pool
                wt_samples[i] <- rhyper(1,
                    m = remaining_wt, # Number of wild-type mtDNA available
                    n = remaining_mut, # Number of mutant mtDNA available
                    k = k_draw # Number of molecules to sample
                )

                # Calculate the number of mutant mtDNA drawn
                mut_samples[i] <- k_draw - wt_samples[i]

                # Update the remaining pools for the next draw
                remaining_wt <- remaining_wt - wt_samples[i]
                remaining_mut <- remaining_mut - mut_samples[i]
            }

            # Calculate heteroplasmy for each progeny cell in this draw
            heteroplasmy_values <- (mut_samples / observed_totals) * 100
            #  Handle cases where observed totals in 0 to avoid Nan
            heteroplasmy_values[is.nan(heteroplasmy_values)] <- 0

            # Return a data frame with simulated values for this single draw
            data.frame(
                mtDNA_total = observed_totals,
                mtDNA_wt = wt_samples,
                mtDNA_mut = mut_samples,
                heteroplasmy = heteroplasmy_values
            )
        },
        simplify = FALSE
    )
    return(simulated_draws)
}

# Simulate mtDNA distribution in progeny cells for all lineages
simulate_all_lineages <- function(mtDNA_data, n_draws = 999) {
    # Split data by lineage
    lineage_groups <- mtDNA_data %>%
        dplyr::group_by(lineage) %>%
        dplyr::group_split()

    all_simulations <- list()

    for (lineage_data in lineage_groups) {
        # Extract NSC data for this lineage
        nsc_data <- lineage_data %>%
            filter(cell_type == "NSC")

        # Extract progeny data for this lineage
        progeny_data <- lineage_data %>%
            filter(cell_type == "Progeny")

        # Extract mtDNA_wt and mtDNA_mut for NSC
        nsc_wt <- nsc_data$mtDNA_wt
        nsc_mut <- nsc_data$mtDNA_mut

        # Get observed mtDNA_total for progeny cells
        observed_progeny_totals <- progeny_data$mtDNA_total

        # Simulate mtDNA distribution in progeny cells for this lineage
        lineage_simulations <- simulate_lineage_hypergeometric(
            nsc_wt = nsc_wt,
            nsc_mut = nsc_mut,
            observed_totals = observed_progeny_totals,
            n_draws = n_draws
        )

        # Store the simulations
        all_simulations[[as.character(unique(nsc_data$lineage))]] <- lineage_simulations
    }

    return(all_simulations)
}

We tested the convergence of the simulation results by running the simulations with different sample sizes (1,000; 5,000; 10,000; and 20,000) and comparing the resulting heteroplasmy distributions and summary statistics.

Test simulation convergence across different sample sizes
test_sim_convergence <- function() {

    # Set of different n_sims to test
    n_sim_values <- c(1000, 5000, 10000, 20000)

    all_distributions <- list()
    all_summaries <- list()

    for (n in n_sim_values) {

        sims <- simulate_all_lineages(
            mtDNA_data = mtDNA_number_data,
            n_draws = n
        )

        simulated_het_df <- purrr::imap_dfr(
            sims,
            ~ purrr::map_dfr(.x, ~.x, .id = "simulation_id"),
            .id = "lineage"
        )

        all_distributions[[length(all_distributions) + 1]] <- simulated_het_df %>%
            dplyr::select(heteroplasmy) %>%
            mutate(n_sims = factor(
                n,
                levels = n_sim_values,
                labels = paste0(format(n_sim_values, big.mark = ","), " sims")
            ))

        all_summaries[[length(all_summaries) + 1]] <- data.frame(
            n_sims = n,
            mean_het = mean(simulated_het_df$heteroplasmy, na.rm = TRUE),
            sd_het = sd(simulated_het_df$heteroplasmy, na.rm = TRUE),
            median_het = median(simulated_het_df$heteroplasmy, na.rm = TRUE),
            var_het = var(simulated_het_df$heteroplasmy, na.rm = TRUE),
            p25_het = quantile(simulated_het_df$heteroplasmy, 0.25, na.rm = TRUE),
            p75_het = quantile(simulated_het_df$heteroplasmy, 0.75, na.rm = TRUE),
            n_unique = n_distinct(simulated_het_df$heteroplasmy)
        )
    }

    list(
        distributions = do.call(rbind, all_distributions),
        summaries = do.call(rbind, all_summaries)
    )
}

convergence_results <- test_sim_convergence()

convergence_dist_df <- convergence_results$distributions
convergence_summary_df <- convergence_results$summaries
Figure 9: Convergence of heteroplasmy distributions across different simulation sample sizes. Kernel density estimates of heteroplasmy levels from simulations with varying numbers of draws (1,000; 5,000; 10,000; and 20,000) are shown. The distributions overlap closely, indicating convergence of the simulation results as sample size increases.
Convergence of Simulation Statistics Across Sample Sizes
N Simulations Mean Het (%) SD Het (%) Median Het (%) Variance 25th %ile 75th %ile N Unique
1,000 66.35 13.42 66.67 180.11 57.14 75 227
5,000 66.34 13.38 66.67 179.09 57.14 75 248
10,000 66.33 13.39 66.67 179.26 57.14 75 252
20,000 66.31 13.40 66.67 179.68 57.14 75 255

The summary statistics were stable across all the sample sizes tested, indicating that the simulations achieved convergence. We used a sample size of 10,000 for hypothesis testing.

Monte-Carlo test of normalized heteroplasmy variance

We compared the normalized heteroplasmy variance (Wonnapinij et al., 2010) for the observed progeny cells to the null distribution generated from the simulations.

For a cell population with heteroplasmies \((h = h_1, h_2, \ldots, h_n)\), the normalized variance is defined as:

\[V'(h) = var(h) / \bar{h}(1 - \bar{h})\]

where \(\bar{h}\) is the mean heteroplasmy of the population and heteroplasmies are expressed as proportions (0-1).

To test whether the observed variance is consistent with random segregation, we:

  1. Calculated the mean normalized variance across lineages from observed data
  2. Generated a null distribution of mean normalized variances from 10,000 simulations under random hypergeometric partitioning
  3. Performed a two-sided Monte Carlo test comparing the observed statistic to the null distribution

The p-value was calculated as the proportion of simulated values that deviated from the null distribution mean by at least as much as the observed value:

\[p = \frac{1 + \sum_{i=1}^{n} \mathbb{1}\left(|S_i - \bar{S}| \geq |S_{obs} - \bar{S}|\right)}{1 + n}\]

where \(S_{obs}\) is the observed mean normalized variance, \(S_i\) are the simulated values, \(\bar{S}\) is the mean of the null distribution, and \(n = 10{,}000\). This two-tailed test assesses whether the observed variance is unusually high or low compared to random segregation.

Monte Carlo test for normalized heteroplasmy variance
# Function to calculate normalized heteroplasmy variance
compute_normalized_variance <- function(heteroplasmy_values) {
    # Ensure heteroplasmy is on a 0-1 scale (proportion)
    if (max(heteroplasmy_values, na.rm = TRUE) > 1) {
        heteroplasmy_values <- heteroplasmy_values / 100
    }

    mean_h <- mean(heteroplasmy_values, na.rm = TRUE)
    var_h <- var(heteroplasmy_values, na.rm = TRUE)

    # If variance is not defined, or if mean is 0 or 1, normalized variance is 0.
    if (is.na(var_h) || mean_h == 0 || mean_h == 1) {
        return(0)
    }

    normalized_variance <- var_h / (mean_h * (1 - mean_h))
    return(normalized_variance)
}

# Compute observed variances for NSCs and progeny cells
observed_statistic_variance <- mtDNA_number_data %>%
    dplyr::filter(cell_type == "Progeny") %>%
    group_by(lineage) %>%
    summarize(
        norm_var_lineage = compute_normalized_variance(heteroplasmy),
        .groups = "drop"
    ) %>%
    summarize(
        mean_norm_var = mean(norm_var_lineage, na.rm = TRUE)
    ) %>%
    dplyr::pull(mean_norm_var)

# Unnest the simulated results list
simulated_het_df <- purrr::imap_dfr(
    progeny_sims,
    ~ purrr::map_dfr(.x, ~.x, .id = "simulation_id"),
    .id = "lineage"
)

# Null distribution
# For each of the n_sims simulation IDs,
# we calculate the mean normalized variance across all lineages
null_distribution_variance <- simulated_het_df %>%
    group_by(simulation_id, lineage) %>%
    summarize(
        norm_var_lineage = compute_normalized_variance(heteroplasmy),
        .groups = "drop_last" # Drops 'lineage', keeps 'simulation_id'
    ) %>%
    summarize(
        mean_norm_var = mean(norm_var_lineage, na.rm = TRUE)
    ) %>%
    dplyr::pull(mean_norm_var)

# Monte Carlo two-sided p-value
null_mean_variance <- mean(null_distribution_variance, na.rm = TRUE)
p_value_variance <- (
    sum(
        abs(null_distribution_variance - null_mean_variance) >=
        abs(observed_statistic_variance - null_mean_variance)) + 1) / (n_sims + 1)

# Standard Error on the mean for null distribution
se_null_variance <- sd(null_distribution_variance, na.rm = TRUE) / sqrt(length(null_distribution_variance))
Figure 10: Monte Carlo test for normalized heteroplasmy variance in progeny cells. Null distribution generated from 10,000 simulations in red and the observed statistic indicated by the dashed line. Two-tailed Monte Carlo p-value shown.

The p-value for a two-sided Monte Carlo test comparing the observed and simulated normalized heteroplasmy variances was 0.1193, suggesting that the observed variance in progeny cells is not significantly different from that expected under a model of random mtDNA segregation.

The standard error on the mean of the null distribution is 1.33e-04, suggesting that the estimate of the mean normalized variance from the simulations is precise.

Density plots of normalized heteroplasmy variance

Kernel density estimation was performed using the Gaussian kernel with bandwidth determined by Silverman’s rule of thumb (Silverman, 1986), calculated as:

\[bw = 0.9 \times \min(sd, IQR/1.34) \times n^{-1/5}\]

where \(sd\) is the standard deviation, \(IQR\) is the interquartile range, and \(n\) is the sample size.

Histogram binwidths were determined using the Freedman-Diaconis rule: \[w = 2 \times \frac{IQR}{n^{1/3}}\]

Figure 11: Density plots of normalized heteroplasmy variance for observed and simulated progeny cells. Kernel density estimation performed using Gaussian kernel with bandwidth determined by Silverman's rule of thumb. Histogram binwidths were determined using the Freedman-Diaconis rule.
Density plots of normalized heteroplasmy variance per lineage
Figure 12: Null distributions from 10,000 random segregation simulations (red density plots) for each lineage. The corresponding observed normalized variance is marked by the dashed blue line. Kernel density estimation performed using Gaussian kernel with bandwidth determined by Silverman's rule of thumb.

Monte-Carlo test of normalized heteroplasmy shift

We then compared the normalized heteroplasmy shift (Johnston and Jones, 2016) for the observed progeny cells to the null distribution generated from the simulations. The normalized heteroplasmy shift is calculated as:

\[\Delta h' = \ln\left( \frac{h_{Progeny}(1 - h_{NSC})} {(1 - h_{Progeny})h_{NSC}} \right)\]

where \(h_{NSC}\) and \(h_{Progeny}\) represent heteroplasmy as proportions (0-1).

To test whether observed shifts deviate from random segregation:

  1. We calculated the mean heteroplasmy shift per lineage (averaged across ~6 progeny cells per lineage), then took the mean across all 14 lineages as our test statistic
  2. We generated a null distribution from 10,000 simulations of random hypergeometric partitioning, calculating the same statistic for each simulated dataset
  3. We performed a two-sided Monte Carlo test comparing the observed mean shift to the null distribution

The p-value was calculated as the proportion of simulated values that deviated from the null distribution mean by at least as much as the observed value:

\[p = \frac{1 + \sum_{i=1}^{n} \mathbb{1}\left(|S_i - \bar{S}| \geq |S_{obs} - \bar{S}|\right)}{1 + n}\]

where \(S_{obs}\) is the observed mean heteroplasmy shift, \(S_i\) are the simulated values, \(\bar{S}\) is the mean of the null distribution, and \(n = 10{,}000\). This two-tailed test assesses whether the observed shift is unusually large in either direction compared to random segregation.

Monte Carlo test for normalized heteroplasmy shift
# Function to calculate heteroplasmy shift from NSC to progeny cells
compute_heteroplasmy_shift <- function(nsc_heteroplasmy, progeny_heteroplasmy, nsc_cn, progeny_cn) {
    p_nsc <- nsc_heteroplasmy / 100
    p_progeny <- progeny_heteroplasmy / 100

    # Squeeze values to avoid -Inf/Inf at 0 and 1
    # Calculate epsilon based on half a molecule in the largest pool
    epsilon <- 1 / (2 *max(c(nsc_cn, progeny_cn)))
    p_nsc <- pmax(epsilon, pmin(1 - epsilon, p_nsc))
    p_progeny <- pmax(epsilon, pmin(1 - epsilon, p_progeny))

    heteroplasmy_shift <- log(
        (p_progeny * (1 - p_nsc)) /
            ((1 - p_progeny) * p_nsc)
    )

    return(heteroplasmy_shift)
}

observed_nsc_data <- mtDNA_number_data %>%
    filter(cell_type == "NSC") %>%
    select(
        lineage,
        heteroplasmy_nsc = heteroplasmy,
        cn_nsc = mtDNA_total)

# Calculate the observed heteroplasmy shifts
observed_statistic_shift <- mtDNA_number_data %>%
    filter(cell_type == "Progeny") %>%
    left_join(observed_nsc_data, by = "lineage") %>%
    mutate(
        het_shift = compute_heteroplasmy_shift(
            nsc_heteroplasmy = heteroplasmy_nsc,
            progeny_heteroplasmy = heteroplasmy,
            nsc_cn = cn_nsc,
            progeny_cn = mtDNA_total)
    ) %>%
    group_by(lineage) %>%
    summarize(
        mean_het_shift_lineage = mean(het_shift, na.rm = TRUE),
        .groups = "drop"
    ) %>%
    summarize(
        mean_het_shift = mean(mean_het_shift_lineage, na.rm = TRUE)
    ) %>%
    pull(mean_het_shift)

simulated_het_shifts_df <- simulated_het_df %>%
    dplyr::mutate(
        lineage = as.numeric(lineage)) %>%
    dplyr::left_join(observed_nsc_data, by = "lineage") %>%
    dplyr::mutate(
        het_shift = compute_heteroplasmy_shift(
            nsc_heteroplasmy = heteroplasmy_nsc,
            progeny_heteroplasmy = heteroplasmy,
            nsc_cn = cn_nsc,
            progeny_cn = mtDNA_total)
    )

# Null distribution at the lineage level
null_distribution_shift <- simulated_het_shifts_df %>%
    group_by(simulation_id, lineage) %>%
    summarize(
        mean_het_shift_lineage = mean(het_shift, na.rm = TRUE),
        .groups = "drop_last"
    ) %>%
    summarize(
        mean_het_shift = mean(mean_het_shift_lineage, na.rm = TRUE),
        .groups = "drop"
    ) %>%
    pull(mean_het_shift)

# Monte Carlo two-sided p-value
null_mean_shift <- mean(null_distribution_shift, na.rm = TRUE)
p_value_shift <- (
    sum(
        abs(null_distribution_shift - null_mean_shift) >=
        abs(observed_statistic_shift - null_mean_shift)) + 1) / (n_sims + 1)

# Standard Error on the mean for null distribution
se_null_shift <- sd(null_distribution_shift, na.rm = TRUE) / sqrt(length(null_distribution_shift))
Figure 13: Monte Carlo test for normalized heteroplasmy shift in progeny cells. Null distribution generated from 10,000 simulations in red and the observed statistic indicated by the dashed line. Two-tailed Monte Carlo p-value shown.

The p-value for a two-sided Monte Carlo test comparing the observed and simulated mean heteroplasmy shifts was 1e-04, suggesting that the observed shift in progeny cells is significantly different from that expected under a model of random mtDNA segregation.

The standard error on the mean of the null distribution is 9.25e-04, suggesting that the estimate of the mean heteroplasmy shift from the simulations is precise.

Density plots of heteroplasmy shifts

Kernel density estimation was performed using the Gaussian kernel with bandwidth determined by Silverman’s rule of thumb (Silverman, 1986), calculated as:

\[bw = 0.9 \times \min(sd, IQR/1.34) \times n^{-1/5}\]

where \(sd\) is the standard deviation, \(IQR\) is the interquartile range, and \(n\) is the sample size.

Histogram binwidths were determined using the Freedman-Diaconis rule:

\[w = 2 \times \frac{IQR}{n^{1/3}}\]

Figure 14: Density plots of heteroplasmy shifts for observed and simulated progeny cells. Kernel density estimation performed using Gaussian kernel with bandwidth determined by Silverman's rule of thumb. Histogram binwidths were determined using Freedman-Diaconis rule.
Density plots of heteroplasmy shifts per lineage
Figure 15: Density plots of the observed progeny heteroplasmy (light blue) overlaid on the null distribution from 10,000 random segregation simulations (red) for each lineage. The heteroplasmy of the parent NSC for each lineage is marked by the dashed line. Kernel density estimation performed using Gaussian kernel with bandwidth = 2.5 times Silverman's rule of thumb, scaled to increase smoothing.

Approximate Bayesian Computation (ABC) analysis

To formally test whether selection is acting during neurogenesis, we performed an Approximate Bayesian Computation (ABC) analysis (Beaumont et al., 2002), a simulation-inference framework where likelihoods are approximated by comparing observed and simulated summary statistics.

The ABC workflow involved:

  1. Summary statistics selection: Identifying the most informative features of the data
  2. Model selection to determine whether selection or pure drift better explains the observed patterns
  3. Parameter estimation to quantify the selection coefficient (s) against mutant mtDNA

Simulation Models

We compared two alternative models of mtDNA inheritance:

Model 1: Pure Bottleneck (s = 0): mtDNA molecules are randomly partitioned to progeny cells through hypergeometric sampling without replacement. Each progeny sequentially draws mtDNA molecules from the NSC pool, which updates after each draw.

Model 2: Selection + Bottleneck (s ≠ 0): Selection modifies mtDNA frequencies before partitioning If \(p\) is the initial mutant mtDNA proportion in the NSC, the post-selection proportion \(p'\) is given by:

\[p' = \frac{p(1+s)}{p(1+s) + (1-p)}\]

where \(s\) is the selection coefficient. Negative \(s\) values indicate purifying selection against mutant mtDNA. After selection, segregation proceeds as in Model 1.

Both models used observed NSC counts and progeny copy numbers for each lineage, ensuring simulations matched the actual bottleneck severity and starting heteroplasmy.

Model definitions for ABC analysis
# Model 1: Random segregation; no selection (s = 0)
# Same algorithm as the simulate_lineage_hypergeometric() function
simulate_pure_bottleneck <- function(nsc_wt, nsc_mut, observed_totals) {

    # Update mtDNA pool after each draw
    remaining_wt <- nsc_wt
    remaining_mut <- nsc_mut
    mut_samples <- numeric(length(observed_totals))

    for (i in seq_along(observed_totals)) {
        current_pool <- remaining_wt + remaining_mut
        if (current_pool == 0) {
            mut_samples[i] <- 0
            next
        }
        k_draw <- min(observed_totals[i], current_pool)
        drawn_wt <- rhyper(
            1,
            m = remaining_wt,
            n = remaining_mut,
            k = k_draw
        )
        mut_samples[i] <- k_draw - drawn_wt
        remaining_wt <- remaining_wt - drawn_wt
        remaining_mut <- remaining_mut - mut_samples[i]
    }

    # Handle cases where observed_totals is 0 to avoid NaN
    het_values <- (mut_samples / observed_totals) * 100
    het_values[is.nan(het_values)] <- 0
    return(het_values)
}

# Model 2: Selection before division (s != 0)
simulate_selection <- function(nsc_wt, nsc_mut, observed_totals, s) {

    total_pool <- nsc_wt + nsc_mut
    het_nsc <- nsc_mut / total_pool
    w_mut <- 1 + s
    freq_mut_eff <- (het_nsc * w_mut) / (het_nsc * w_mut + (1 - het_nsc))
    eff_mut <- round(freq_mut_eff * total_pool)
    eff_mut <- max(0, min(eff_mut, total_pool))
    eff_wt <- total_pool - eff_mut

    return(simulate_pure_bottleneck(
        nsc_wt = eff_wt,
        nsc_mut = eff_mut,
        observed_totals = observed_totals
    ))
}

Summary statistics selection for ABC

We evaluated 14 candidate summary statistics using Random Forest variable importance measures (Raynal et al., 2019). For each lineage, we simulated 5,000 datasets with \(s \sim \mathcal{U}(-1, 0.1)\), calculated all summary statistics, and trained a regression ABC-RF model to predict \(s\) from the statistics and extracted variable importance scores.

Summary statistics calculation and selection for ABC analysis
# Calculate all summary statistics for initial testing
calc_summary_stats <- function(progeny_het, nsc_het, nsc_cn, progeny_cn, stats_list) {
    # Return NA values if data is insufficient for calculations
    if (all(is.na(progeny_het)) || length(progeny_het) < 2) {
        return(setNames(rep(NA, length(stats_list)), stats_list))
    }

    # Calculate heteroplasmy shifts using the robust, data-driven epsilon method
    het_shifts <- compute_heteroplasmy_shift(rep(nsc_het, length(progeny_het)), progeny_het, rep(nsc_cn, length(progeny_het)), progeny_cn)

    # Squeeze values for skewness/kurtosis to avoid issues at the boundaries
    progeny_het_squeezed <- pmax(0.001, pmin(99.999, progeny_het))

    # Initialize an empty list to store results
    results <- list()

    # Dynamically calculate only the requested statistics
    for (stat in stats_list) {
        results[[stat]] <- switch(
            stat,
            "mean_het" = mean(progeny_het, na.rm = TRUE),
            "var_het" = var(progeny_het, na.rm = TRUE),
            "median_het" = median(progeny_het, na.rm = TRUE),
            "mean_shift" = mean(het_shifts, na.rm = TRUE),
            "var_shift" = var(het_shifts, na.rm = TRUE),
            "min_het" = min(progeny_het, na.rm = TRUE),
            "max_het" = max(progeny_het, na.rm = TRUE),
            "q25_het" = {
                q <- quantile(progeny_het, 0.25, na.rm = TRUE)
                names(q) <- NULL  # Remove the "25%" name
                q
            },
            "q75_het" = {
                q <- quantile(progeny_het, 0.75, na.rm = TRUE)
                names(q) <- NULL  # Remove the "75%" name
                q
            },
            "skew_het" = moments::skewness(progeny_het_squeezed, na.rm = TRUE),
            "kurt_het" = moments::kurtosis(progeny_het_squeezed, na.rm = TRUE),
            "mad_het" = mad(progeny_het, na.rm = TRUE),
            "prop_below_nsc" = mean(progeny_het < nsc_het, na.rm = TRUE),
            "norm_var" = compute_normalized_variance(progeny_het),
            NA  # Default case if the stat is not recognized
        )
    }

    # Return the results as a named vector
    return(unlist(results))
}

select_summary_stats <- function(
    mtDNA_data, stats_list = c("mean_het", "var_het", "median_het", "mean_shift", "var_shift", "min_het", "max_het", "q25_het", "q75_het", "skew_het", "kurt_het", "mad_het", "prop_below_nsc", "norm_var")) {

    # Number of simulations for testing
    n_rf_sims <- 5000

    all_lineage_var_imp_df <- purrr::map_dfr(unique(mtDNA_data$lineage), ~{

        lineage_id <- .x

        # Filter data for current lineage
        lineage_data <- dplyr::filter(mtDNA_data, lineage == lineage_id)
        nsc_data <- dplyr::filter(lineage_data, cell_type == "NSC")
        progeny_data <- dplyr::filter(lineage_data, cell_type == "Progeny")

        # Generate the reference table for this lineage
        # Prior distribution for s
        prior_s_rf <- runif(n_rf_sims, min = -1, max = 0.1)

        summary_stats_matrix <- t(sapply(prior_s_rf, function(s) {

            # Run simulations under selection model
            sim_het <- simulate_selection(
                nsc_wt = nsc_data$mtDNA_wt,
                nsc_mut = nsc_data$mtDNA_mut,
                observed_totals = progeny_data$mtDNA_total,
                s = s
            )

            # Calculate summary statistics for the simulated data
            calc_summary_stats(
                progeny_het = sim_het,
                nsc_het = nsc_data$heteroplasmy,
                nsc_cn = nsc_data$mtDNA_total,
                progeny_cn = progeny_data$mtDNA_total,
                stats_list = stats_list
            )
        }))

        ref_table <- data.frame(
            s = prior_s_rf,
            as.data.frame(summary_stats_matrix)
        )

        # Filter out NA values
        ref_table <- ref_table %>% dplyr::filter(complete.cases(.))

        # Train the regABCRF model
        abc_rf_model <- regAbcrf(
            s ~ .,
            data = ref_table,
            ntree = 500,
            paral = TRUE
        )

        # Extract variable importance
        (data.frame(
            lineage = lineage_id,
            Statistic = names(abc_rf_model$model.rf$variable.importance),
            Importance = abc_rf_model$model.rf$variable.importance
        ))
    })

    return(all_lineage_var_imp_df)

}

var_importance_res <- select_summary_stats(mtDNA_number_data)

selected_stats <- var_importance_res %>%
    dplyr::group_by(Statistic) %>%
    summarize(MedianImportance = median(Importance, na.rm = TRUE)) %>%
    arrange(desc(MedianImportance)) %>%
    slice_head(n = 5) %>%
    dplyr::pull(Statistic) %>%
    as.character()
Figure 16: Variable importance of 14 candidate summary statistics for ABC parameter estimation. Box plots show Random Forest variable importance scores across all 14 lineages, ranked by median importance. The top 5 statistics (mean heteroplasmy, normalized heteroplasmy shift, median heteroplasmy, and 25th and 75th percentiles of the heteroplasmy distribution) were selected for downstream ABC analysis based on their discriminatory power for estimating selection coefficients.

Based on the variable importance results, we selected the top five summary statistics for ABC model selection and parameter estimation:

Top 5 summary statistics selected for ABC inference
Rank Summary Statistic Description Median Importance
1 Mean Heteroplasmy Mean heteroplasmy across progeny 142.5
2 Normalized Heteroplasmy Shift Mean change in heteroplasmy from NSC to progeny 103.5
3 Median Heteroplasmy Median heteroplasmy across progeny 81.1
4 75th Percentile 75th percentile of progeny heteroplasmy 55.5
5 25th Percentile 25th percentile of progeny heteroplasmy 47.4

ABC Model Selection using ABC-RF

We used ABC Random Forest (ABC-RF) classification (Pudlo et al., 2016) to compare the two models (pure drift vs. selection + drift). For each lineage, we simulated 10,000 datasets (5,000 per model), calculated the five selected summary statistics, and trained an ABC-RF classifier to predict the model from the statistics. We then predicted the best model for the observed data and calculated posterior probabilities, Bayes factors (\(\text{BF} = P(\text{selection})/P(\text{drift})\))

ABC model choice using ABC-RF
abcrf_model_choice <- function(
    mtDNA_data,
    selected_stats,
    n_particles = 10000) {

    all_lineages <- unique(mtDNA_data$lineage)
    all_models <- list()

    model_choice_results <- map_dfr(all_lineages, function(lineage_num) {
        lineage_data <- filter(mtDNA_data, lineage == lineage_num)
        nsc_data <- filter(lineage_data, cell_type == "NSC")
        progeny_data <- filter(lineage_data, cell_type == "Progeny")

        # Calculate summary statistics
        target_stats <- calc_summary_stats(
            progeny_het = progeny_data$heteroplasmy,
            nsc_het = nsc_data$heteroplasmy,
            nsc_cn = nsc_data$mtDNA_total,
            progeny_cn = progeny_data$mtDNA_total,
            stats_list = selected_stats
        )

        # Return NA if invalid
        if (any(is.na(target_stats))) {
            warning(sprintf("Lineage %d: NA in summary statistics", lineage_num))
            return(data.frame(
                lineage = lineage_num,
                model_selected = NA,
                prob_drift = NA,
                prob_selection = NA,
                bayes_factor = NA,
                log_bf = NA,
                evidence_strength = "Failed"
            ))
        }

        # Simulate and calculate summary stats
        simulate_stats <- function(model_type, s = NULL) {
            if (model_type == "drift") {
                sim_het <- simulate_pure_bottleneck(
                    nsc_wt = nsc_data$mtDNA_wt,
                    nsc_mut = nsc_data$mtDNA_mut,
                    observed_totals = progeny_data$mtDNA_total
                )
            } else if (model_type == "selection") {
                s <- runif(1, -1, 0.1)
                sim_het <- simulate_selection(
                    nsc_wt = nsc_data$mtDNA_wt,
                    nsc_mut = nsc_data$mtDNA_mut,
                    observed_totals = progeny_data$mtDNA_total,
                    s = s
                )
            }

            calc_summary_stats(
                progeny_het = sim_het,
                nsc_het = nsc_data$heteroplasmy,
                nsc_cn = nsc_data$mtDNA_total,
                progeny_cn = progeny_data$mtDNA_total,
                stats_list = selected_stats
            )
        }

        # Generate reference table
        n_sims_per_model <- n_particles / 2

        drift_sims <- replicate(n_sims_per_model, {
            simulate_stats("drift")
        }, simplify = FALSE)

        selection_sims <- replicate(n_sims_per_model, {
            simulate_stats("selection")
        }, simplify = FALSE)

        ref_table <- data.frame(
            rbind(
                do.call(rbind, drift_sims),
                do.call(rbind, selection_sims)
            ),
            model = factor(
                rep(c("drift", "selection"), each = n_sims_per_model)
            )
        )

        ref_table <- ref_table[complete.cases(ref_table), ]

        # Train ABC Random Forest
        abcrf_model <- abcrf(
            model ~ .,
            data = ref_table,
            ntree = 500,
            paral = FALSE
        )

        all_models[[as.character(lineage_num)]] <<- abcrf_model

        target_obs <- as.data.frame(t(target_stats))
        colnames(target_obs) <- selected_stats

        prediction <- predict(
            abcrf_model,
            obs = target_obs,
            training = ref_table,
            ntree = 500,
            paral = FALSE
        )

        # Extract results with Inf protection
        prob_selection <- pmin(prediction$post.prob, 0.999)
        prob_drift <- pmax(1 - prediction$post.prob, 0.001)
        bayes_factor <- prob_selection / prob_drift

        data.frame(
            lineage = lineage_num,
            model_selected = as.character(prediction$allocation),
            prob_drift = prob_drift,
            prob_selection = prob_selection,
            bayes_factor = bayes_factor,
            log_bf = log(bayes_factor)
        )
    })

    return(list(results = model_choice_results, models = all_models))
}

# Run abcrf model choice
abc_output <- abcrf_model_choice(
    mtDNA_number_data,
    selected_stats,
    n_particles = 10000
)

model_choice_results <- abc_output$results
abcrf_models <- abc_output$models
Figure 17: ABC-RF classification performance across lineages. Black line and points show out-of-bag (OOB) error rates for each lineage-specific random forest classifier. Red and blue points indicate model-specific misclassification rates for drift and selection models, respectively.

The ABC-RF classifiers demonstrated discriminatory power between drift and selection models, achieving a mean out-of-bag (OOB) error of 19.6% (SD: 4.1%, range: 11.9%-25.5%) across all lineages. Class-specific misclassification rates were asymmetric, with 12.2% for drift versus 27.1% for selection. The classifier more often misclassifies selection as drift than vice versa, minimizing false positives.

Figure 18: Posterior probabilities for selection vs. drift models across all lineages. Stacked bars show the relative support for selection + bottleneck (blue) versus pure drift (red) models. All 14 lineages favor the selection model, with 13/14 showing P(Selection) > 0.9, indicating strong evidence for selection against mutant mtDNA during NSC division.

All lineages showed stronger support for the selection + bottleneck model compared to pure drift, with posterior probabilities for selection ranging from 0.713 to 0.999, and 13 out of 14 lineages exceeding 0.9. This result is particularly robust given the classifier’s conservative tendency to misclassify selection as drift, indicating that the selection signal in the data is sufficiently strong to overcome this bias.

ABC Parameter Estimation using SMC-ABC

We estimated selection coefficients using Sequential Monte Carlo ABC (Lenormand et al., 2013), an adaptive algorithm that iteratively refines posterior distributions through sequential resampling. We specified a uniform prior, \(s \sim \mathcal{U}(-1, 0.1)\), for the selection coefficient. This prior range allows for strong purifying selection (\(s = -1\)) to weak positive selection (\(s = +0.1\)), with the asymmetry reflecting the observed reduction in heteroplasmy from NSCs to progeny cells. We used 1,500 particles per lineage with acceptance threshold declining adaptively from 50% to 2%. For each proposed \(s\), we simulated progeny heteroplasmy using observed NSC mtDNA counts and calculated the five selected summary statistics.

SMC-ABC parameter estimation for selection coefficient (s)
# SMC-ABC parameter estimation (for individual lineages)
abc_parameter_estimation <- function(
    lineage_data,
    stats_list,
    n_particles = 1500) {

        # Extract NSC and progeny data
        progeny_het <- lineage_data$heteroplasmy[lineage_data$cell_type == "Progeny"]
        nsc_het <- lineage_data$heteroplasmy[lineage_data$cell_type == "NSC"]
        nsc_wt <- lineage_data$mtDNA_wt[lineage_data$cell_type == "NSC"]
        nsc_mut <- lineage_data$mtDNA_mut[lineage_data$cell_type == "NSC"]
        nsc_cn <- lineage_data$mtDNA_total[lineage_data$cell_type == "NSC"]
        progeny_cn <- lineage_data$mtDNA_total[lineage_data$cell_type == "Progeny"]

        # Simulation function for SMC-ABC
        simulate_progeny <- function(params)  {
            s <- params[1]

            sim_het <- simulate_selection(
                nsc_wt = nsc_wt,
                nsc_mut = nsc_mut,
                observed_totals = progeny_cn,
                s = s
            )

            calc_summary_stats(
                progeny_het = sim_het,
                nsc_het = nsc_het,
                nsc_cn = nsc_cn,
                progeny_cn = progeny_cn,
                stats_list = stats_list
            )
        }

        # Define prior distributions
        prior_list <- list(
                c("unif", -1, 0.1)  # Prior for s
            )

        # Calculate target statistics
        target_stats <- calc_summary_stats(
            progeny_het = progeny_het,
            nsc_het = nsc_het,
            nsc_cn = nsc_cn,
            progeny_cn = progeny_cn,
            stats_list = stats_list
        )

        # Run SMC-ABC
        smc_result <- try(EasyABC::ABC_sequential(
            method = "Lenormand",
            model = simulate_progeny,
            prior = prior_list,
            summary_stat_target = target_stats,
            nb_simul = n_particles,
            p_acc_min = 0.02,
            alpha = 0.5,
            verbose = FALSE
        ), silent = TRUE)

        if (inherits(smc_result, "try-error")) {
            warning("SMC-ABC failed for lineage ", unique(lineage_data$lineage))
            return(NULL)
        }

        return(as.data.frame(smc_result$param))
}

# Run SMC-ABC on all lineages
run_smc_abc_all_lineages <- function(
    mtDNA_data,
    stats_list,
    n_particles = 1500) {

    all_lineages <- unique(mtDNA_data$lineage)
    all_posteriors <- list()

    results <- map_dfr(all_lineages, function(lineage_num) {
        lineage_data <- filter(mtDNA_data, lineage == lineage_num)

        progeny_het <- lineage_data$heteroplasmy[lineage_data$cell_type == "Progeny"]
        nsc_het <- lineage_data$heteroplasmy[lineage_data$cell_type == "NSC"]
        nsc_cn <- lineage_data$mtDNA_total[lineage_data$cell_type == "NSC"]
        progeny_cn <- lineage_data$mtDNA_total[lineage_data$cell_type == "Progeny"]

        # Calculate target statistics
        target_stats <- calc_summary_stats(
            progeny_het = progeny_het,
            nsc_het = nsc_het,
            nsc_cn = nsc_cn,
            progeny_cn = progeny_cn,
            stats_list = stats_list
        )

        # Check data quality
        if (any(is.na(target_stats)) || length(progeny_het) < 3) {
            all_posteriors[[as.character(lineage_num)]] <<- NULL

            return(data.frame(
                lineage = lineage_num,
                s_median = NA,
                s_ci_lower = NA,
                s_ci_upper = NA,
                success = FALSE
            ))
        }

        # Run SMC-ABC
        posterior <- abc_parameter_estimation(
            lineage_data,
            stats_list,
            n_particles
        )

        # Store posterior samples and return results
        if (!is.null(posterior) && nrow(posterior) > 0) {
            all_posteriors[[as.character(lineage_num)]] <<- posterior

            data.frame(
                lineage = lineage_num,
                s_median = median(posterior$V1),
                s_ci_lower = quantile(posterior$V1, 0.025),
                s_ci_upper = quantile(posterior$V1, 0.975),
                n_posterior = nrow(posterior),
                success = TRUE
            )
        } else {
            all_posteriors[[as.character(lineage_num)]] <<- NULL

            data.frame(
                lineage = lineage_num,
                s_median = NA,
                s_ci_lower = NA,
                s_ci_upper = NA,
                n_posterior = NA,
                success = FALSE
            )
        }
    })

    # Return both results and posteriors
    return(list(
        results = results,
        posteriors = all_posteriors
    ))
}

smc_abc_results <- run_smc_abc_all_lineages(
    mtDNA_number_data,
    selected_stats,
    n_particles = 1500
)

# Add additional metrics
smc_abc_results$results <- smc_abc_results$results %>%
    mutate(
        s_ci_width = s_ci_upper - s_ci_lower,
        significant_selection = s_ci_upper < 0 | s_ci_lower > 0,
        selection_strength = case_when(
            !success ~ "Failed",
            is.na(s_median) ~ "Failed",
            s_median > -0.1 ~ "Weak/None",
            s_median > -0.3 ~ "Moderate",
            s_median > -0.6 ~ "Strong",
            TRUE ~ "Very Strong"
        )
    )
Figure 19: Selection coefficient estimates for all lineages from Sequential Monte Carlo ABC. Points show posterior median estimates with 95% credible intervals (horizontal bars). All lineages show negative selection coefficients (s < 0), indicating purifying selection against mutant mtDNA. The red dashed line at s = 0 represents neutral evolution.
Figure 20: Posterior distributions of selection coefficients (s) for all lineages. Ridge plots show the full posterior density from SMC-ABC, colored by log₁₀(Bayes Factor) from model selection analysis. Darker colors indicate stronger evidence for selection over drift. All distributions are centered below s = 0 (red dashed line), confirming purifying selection against mutant mtDNA.
Summary statistics for estimated selection coefficients across all lineages
Number of lineages analyzed 14
Lineages with significant selection 14 (100.0%)
Median selection coefficient -0.510
Mean selection coefficient (± SD) -0.502 ± 0.099
Range -0.637 to -0.320
Interquartile range (Q25 - Q75) -0.565 to -0.438

The median selection coefficient across all lineages was -0.51, with 14 out of 14 lineages showing significant evidence for negative selection (95% credible interval excluding zero).

Parameter recovery validation for SMC-ABC
# Parameter recovery validation
abc_parameter_recovery <- function(mtDNA_data, stats_list, n_tests = 20) {

    # Test across range of selection strengths
    true_s_values <- seq(-0.8, -0.2, length.out = n_tests)

    # Use representative lineage as template
    template_lineage <- mtDNA_data %>% filter(lineage == 1)
    template_nsc <- template_lineage %>% filter(cell_type == "NSC")
    template_progeny <- template_lineage %>% filter(cell_type == "Progeny")

    recovery_results <- map_dfr(true_s_values, function(s_true) {

        # Simulate data with known s
        sim_progeny_het <- simulate_selection(
            nsc_wt = template_nsc$mtDNA_wt,
            nsc_mut = template_nsc$mtDNA_mut,
            observed_totals = template_progeny$mtDNA_total,
            s = s_true
        )

        # Create fake lineage with simulated data
        fake_lineage <- template_lineage
        fake_lineage$lineage <- 999
        fake_lineage$heteroplasmy[fake_lineage$cell_type == "Progeny"] <- sim_progeny_het

        # Estimate s using SMC-ABC
        result <- tryCatch({
            abc_parameter_estimation(
                lineage_data = fake_lineage,
                stats_list = stats_list,
                n_particles = 1000
            )
        }, error = function(e) NULL)

        if (!is.null(result) && nrow(result) > 0) {
            s_median <- median(result$V1)
            s_ci_lower <- quantile(result$V1, 0.025)
            s_ci_upper <- quantile(result$V1, 0.975)

            data.frame(
                s_true = s_true,
                s_estimated = s_median,
                s_ci_lower = s_ci_lower,
                s_ci_upper = s_ci_upper,
                in_ci = (s_true >= s_ci_lower & s_true <= s_ci_upper),
                bias = s_median - s_true,
                abs_error = abs(s_median - s_true)
            )
        } else {
            data.frame(
                s_true = s_true,
                s_estimated = NA,
                s_ci_lower = NA,
                s_ci_upper = NA,
                in_ci = NA,
                bias = NA,
                abs_error = NA
            )
        }
    })

    return(recovery_results)
}

# Run parameter recovery
recovery_results <- abc_parameter_recovery(
    mtDNA_data = mtDNA_number_data,
    stats_list = selected_stats,
    n_tests = 20
)

# Calculate summary statistics
recovery_summary <- recovery_results %>%
    filter(!is.na(s_estimated)) %>%
    summarize(
        n_tests = n(),
        coverage = mean(in_ci, na.rm = TRUE) * 100,
        mean_bias = mean(bias, na.rm = TRUE),
        mean_abs_error = mean(abs_error, na.rm = TRUE),
        rmse = sqrt(mean(bias^2, na.rm = TRUE))
    )
Figure 21: Parameter recovery validation for SMC-ABC. Points show estimated selection coefficients with 95% credible intervals (error bars) versus true values used in simulations. The dashed diagonal line indicates perfect recovery.

Parameter recovery analysis across 20 simulated datasets demonstrated that SMC-ABC reliably estimates selection coefficients. The method achieved 95% coverage (close to the nominal 95%), with mean absolute error of 0.063 and RMSE of 0.085. Estimates showed no systematic bias (mean bias: 0.034), with points distributed evenly around the line of perfect recovery. The test range (s = -0.8 to -0.2) encompasses all posterior estimates observed across lineages (mean: -0.502 ± 0.099, range: -0.637 to -0.320) with conservative buffers extending 25-37% beyond the most extreme estimates. Parameter uncertainty increased slightly at stronger selection strengths (s < -0.6), reflecting the greater challenge of precisely estimating extreme parameter values. These metrics validate the reliability of our parameter estimation procedure for inferring selection coefficients from experimental data with similar characteristics to our observations.

Software and Statistical Analysis

All analyses were performed in R version 4.4.2 (R Core Team, 2024). Data manipulation used dplyr and tidyr (Wickham et al., 2019), visualization used ggplot2 (Wickham, 2016), mixed-effects models were fit with nlme (Pinheiro and Bates, 2000), and ABC analyses used EasyABC (Jabot et al., 2013) and abcrf (Pudlo et al., 2016; Raynal et al., 2019). Complete package versions are listed in Session Information.

Statistical tests were two-sided unless otherwise stated. Monte Carlo p-values were calculated using the formula \((k + 1) / (n + 1)\), where \(k\) is the number of simulated values as extreme as the observed, and \(n\) is the number of simulations.

References

Beaumont, M.A., Zhang, W., Balding, D.J., 2002. Approximate bayesian computation in population genetics. Genetics 162, 2025–2035. https://doi.org/10.1093/genetics/162.4.2025
Jabot, F., Faure, T., Dumoulin, N., 2013. EasyABC: Efficient approximate bayesian computation sampling schemes.
Johnston, I.G., Jones, N.S., 2016. Evolution of cell-to-cell variability in stochastic, controlled, heteroplasmic mtDNA populations. The American Journal of Human Genetics 99, 1150–1162. https://doi.org/10.1016/j.ajhg.2016.09.016
Lenormand, M., Jabot, F., Deffuant, G., 2013. Adaptive approximate bayesian computation for complex models. Computational Statistics 28, 2777–2796. https://doi.org/10.1007/s00180-013-0428-3
Pinheiro, J.C., Bates, D.M., 2000. Mixed-effects models in s and s-PLUS. Springer, New York. https://doi.org/10.1007/978-1-4419-0318-1
Pudlo, P., Marin, J.-M., Estoup, A., Cornuet, J.-M., Gautier, M., Robert, C.P., 2016. Reliable ABC model choice via random forests. Bioinformatics 32, 859–866. https://doi.org/10.1093/bioinformatics/btv684
R Core Team, 2024. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.
Raynal, L., Marin, J.-M., Pudlo, P., Ribatet, M., Robert, C.P., Estoup, A., 2019. ABC random forests for bayesian parameter inference. Bioinformatics 35, 1720–1728. https://doi.org/10.1093/bioinformatics/bty867
Silverman, B.W., 1986. Density estimation for statistics and data analysis, Monographs on statistics and applied probability. CRC press, London. https://doi.org/10.1201/9781315140919
Wickham, H., 2016. ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. https://doi.org/10.1007/978-3-319-24277-4
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T.L., Miller, E., Bache, S.M., Müller, K., Ooms, J., Robinson, D., Seidel, D.P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., Yutani, H., 2019. Welcome to the tidyverse. Journal of Open Source Software 4, 1686. https://doi.org/10.21105/joss.01686
Wonnapinij, P., Chinnery, P.F., Samuels, D.C., 2010. Previous estimates of mitochondrial DNA mutation level variance did not account for sampling error: Comparing the mtDNA genetic bottleneck in mice and humans. American Journal of Human Genetics 86, 540–550. https://doi.org/10.1016/j.ajhg.2010.02.023
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS 26.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] kableExtra_1.4.0.19 moments_0.14.1      abcrf_1.9          
 [4] EasyABC_1.5.2       abc_2.2.2           locfit_1.5-9.12    
 [7] MASS_7.3-61         quantreg_6.1        SparseM_1.84-2     
[10] nnet_7.3-19         abc.data_1.1        nlme_3.1-168       
[13] lattice_0.22-6      car_3.1-3           carData_3.0-5      
[16] RColorBrewer_1.1-3  ggridges_0.5.6      ggsignif_0.6.4     
[19] ggsci_3.2.0         lubridate_1.9.4     forcats_1.0.0      
[22] stringr_1.5.1       dplyr_1.1.4         purrr_1.1.0        
[25] readr_2.1.5         tidyr_1.3.1         tibble_3.3.0       
[28] ggplot2_3.5.2       tidyverse_2.0.0     janitor_2.2.1      
[31] readxl_1.4.5        here_1.0.1          conflicted_1.2.0   

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1   viridisLite_0.4.2  farver_2.1.2       fastmap_1.2.0     
 [5] tensorA_0.36.2.1   digest_0.6.37      timechange_0.3.0   lifecycle_1.0.4   
 [9] survival_3.7-0     magrittr_2.0.3     compiler_4.4.2     rlang_1.1.6       
[13] tools_4.4.2        yaml_2.3.10        knitr_1.50         labeling_0.4.3    
[17] htmlwidgets_1.6.4  mnormt_2.1.1       xml2_1.3.8         abind_1.4-8       
[21] withr_3.0.2        grid_4.4.2         scales_1.4.0       iterators_1.0.14  
[25] cli_3.6.5          rmarkdown_2.29     ragg_1.4.0         generics_0.1.4    
[29] rstudioapi_0.17.1  tzdb_0.5.0         cachem_1.1.0       splines_4.4.2     
[33] parallel_4.4.2     cellranger_1.1.0   matrixStats_1.5.0  vctrs_0.6.5       
[37] Matrix_1.7-1       jsonlite_2.0.0     hms_1.1.3          Formula_1.2-5     
[41] systemfonts_1.2.3  foreach_1.5.2      pls_2.8-5          glue_1.8.0        
[45] codetools_0.2-20   stringi_1.8.7      gtable_0.3.6       pillar_1.11.0     
[49] htmltools_0.5.8.1  R6_2.6.1           textshaping_1.0.1  lhs_1.2.0         
[53] doParallel_1.0.17  rprojroot_2.1.0    evaluate_1.0.4     memoise_2.0.1     
[57] snakecase_0.11.1   MatrixModels_0.5-4 Rcpp_1.1.0         svglite_2.2.1     
[61] ranger_0.17.0      xfun_0.52          pkgconfig_2.0.3